Coursebook: Data Wrangling and Visualization


Training Objectives

This coursebook is intended for participants who have completed the preceding courses offered in the Data Science in Python Specialization. This is the third course, Data Wrangling and Visualization

The coursebook focuses on:

The final part of this course is a Graded Asssignment, where you are expected to apply all that you've learned on a new dataset, and attempt the given questions.

Data Visualization

Libraries

You will need to use pip install <library_name> to install any libraries listed below that are not already downloaded onto your machine. You then load the libraries into your workspace using the import:

Datasets

Employee promotion dataset publised by Kaggle

The information of the dataset:

Data Preprocessing

Don't forget to check and change the data types

Plotly Python Graphing Library

Plotly's Python graphing library makes interactive, publication-quality graphs. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, polar charts, and bubble charts.

The plotly.express module (usually imported as px) contains functions that can create entire figures at once, and is referred to as Plotly Express or PX. Plotly Express is a built-in part of the plotly library, and is the recommended starting point for creating most common figures. Every Plotly Express function uses graph objects internally and returns a plotly.graph_objects.Figure instance.

Introduction to plotly.express

Your first Visualization

As our first visualization, let's try to visualize the number of promotion (is_promoted == 'Yes') for each department, and see how it helps us understand the promotion condition between every department.

First, we need to subset our data which condition is only employees that promotion status is 'Yes'

Then we can further preprocess our data into a more appropriate format for our visualization:

Now, let's create a plotly figure using px.bar()

newplot

The resulting plot above are using px.bar() function and we put the data inside the function as a parameter. It will return a bar plot that the index as a x-axis and value as a y-axis.

We can do this iteratively until we have a visualization that suits our purpose. For example we want to rename the x and y text. We can add labels parameter.

newplot1

labels parameter By default, column names are used in the figure for axis titles, legend entries and hovers. This parameter allows this to be overridden. The keys of this dict should correspond to column names, and the values should correspond to the desired label to be displayed.

So, by now, we already see how flexible it is to draw a visualization using px; we should play our creativity here!

For now, let's move into basic visual enchancements parts.

Basic Visual Enchancement

As we demonstate earlier, we already know that plotly.express parameter are highly customable, but that's also it's caveat: sometimes the number of options is very overhelming! So, in this part, we will show some basic plotly.express parts to add or edit.

Let's start by adding some parameter and labels to make out visualization more clear:

newplot2

As we can see, the plot are getting clearer as we adjust more part of the plots.

Many Ways to Visualize a Context

Visualization is a powerful way to deliver context from our data. If we could choose a good way to communicate our context, our audience will get the insight that we want to deliver.

In this part, we will cover some basic visualization contexts:

Categorical Ranking

Categorical ranking is one of the most basic ways to communicate how our categorical variable could show a different behaviour between its levels in terms of a numerical output.

We will cover the basic way to do a categorical ranking and how to further breakdown the insight using the promotion dataset.

General Ranking

In visualize categorical ranking, we could use bar plot to show differences in magnitude of each levels in our categorical variables.

For a stater, let's try to see the ranking of department in terms of number of employee.

We will start by making the data aggregation using groupby:

Syntax:

df.groupby([COLUMNS_TO_GROUP]).AGGFUNC()[[VALUES]]

Now, take a look at the visualization below:

newplot3

A simple bar plot, if visualized properly, is really powerful for categorical ranking. Our plot is a very prominent example for that: we could already see which department is the highest or the lowest, and we could also see the big picture regarding the ranking in terms of number of employee.

We should also take a note on how some additional information could help a lot in making our visualization more informative.

Breaking Down a Ranking

As in previous example, visualizing a categorical ranking could help us gaining some insight. But, oftenly, we need to make some breaking down to the ranking in order to gain more insight.

Let’s try, for example, re-visualize the ranking but by breaking down into the promotion status:

Let's take a look into the visualization below:

You can check the color palette availables through this documentation here and here.

newplot4

This bar plot variation is called stacked bar plot. It help us to see the ranking in general, while also see the share of some more categorical variable inside each levels. For example, even though we still see the original ranking as usual, now we gain the insight that Sales & Marketing department have the highest proportion of promotion relative to the number of employee inside that department. A very crucial finding for this context.

Sometimes, we also need to see the exact difference regarding the breakdown. For this purpose, we could use barmode='group' to show the relative height in the breakdown:

newplot5

Data Distribution

Data distribution is a, slightly statistical, way to see how our numerical data distributed inside our sample dataset. One thing that should be noted for this visualization: it only works for continuous numerical variable.

In this part, we will discuss how to properly visualize and interpret distribution visualizations using promotion dataset.

Simple Numerical Distribution

The most straightforward way to visualize a data distribution is using histogram plot. An histogram plot could be made by binning our numerical variables into some bins, which each has their unique threshold.

For example, let’s see how length_of_service is distributed between the employees if we use 30 bins:

newplot6

The visualization above show us that the most frequent range of age is around 2 to 3 years.

Breaking Down Numerical Distribution

Breaking down the numerical distribution could also giving us more insight regarding our data; it is very useful to compare how data distribution differ between a categorical levels.

There are several way to breakdown a data distribution, and its related on how many levels in our categorical variables.

If we only have two levels inside the categorical variable, it is very straightforward to just differ the histogram color’s parameter:

newplot7

Knowledge Check: Bar plot vs Histogram

After creating the two plots above, what is the difference between a bar plot and a histogram?

As we can see from our visualization, a correct breakdown could explain more on our numerical data distribution.

But if we have more than two levels, it would be very messy if we still using histogram plot. So instead of using histogram plot, we could use boxplot to show the data distribution and its breakdown.

Let’s try an example by splitting the age distribution by department and promotion status:

newplot8

As we can see from our plot, breaking down into more categorical variables is very helpful to explore more insight. For example, we could see a strong difference in length of service distribution for Finance, and R&D department; but it is not too strong for another department.

Correlation Between Data

Correlation is also one of the popular context that we could explore. It could help us exploring any relation between the variation of two values.

In this part, we will discuss how to show a proper correlation visualization and its interpretation using promotion dataset.

Between Continous Variables

The most common form of correlation is between continuous numerical variables. It could show us if the two variables are sharing a variation patterns, which oftenly, very insightful to explaining our dataset.

For example, let’s try to visualize how length of service relate to the age for Technology samples.

newplot9

As we can see from plot above, there is no relation between average training score with the length of sevice; the longer an employee stay at the company, his/her average score isn't depend to the length of service.

It is also common to breakdown the information in our scatter plot. But this is relatively difficult to be achieved, since the scatter plot is very easy to get messy. So the most straightforward way is to make a plot between category; in plotly.express, we could achieve this using facet parameter.

Let’s try to breakdown our scatter plot by promotion status.

newplot10

Let’s try to answer that using KPI mets status.

newplot11

As we could see, giving more feature to our scatter plot could give more explanation.

Time-based Evolution

Time-based evolution, or simply time series dataset, is highly used analysis in every aspect of business or other technical domains. It could help to see clearer dynamics of a numerical value in terms of time dimension.

newplot13

As we could see from the plot, from the time series plot we could gain clearer insight regarding the frequency of employee join date. For interpretation, we should focus on some time series component: trend, seasonality, and shocks. First of all, we could conclude that our time series doesn’t have a significant seasonality pattern, so we could ignore that.


Knowledge Check: Using Plot

Consider the following data frame:

Summary

The coursebook covers many aspects of plotting, including using visualization libraries such as plotly.express and other supporting libraries. I hope you've managed to get a good grasp of the plotting philosophy behind plotly.express, and have built a few visualizations with it yourself!

Happy coding!